{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TXT Loader\n",
    "\n",
    "- Author: [seofield](https://github.com/seofield)\n",
    "- Peer Review : [Kane](https://github.com/HarryKane11), [Suhyun Lee](https://github.com/suhyun0115)\n",
    "- Proofread : [JaeJun Shim](https://github.com/kkam-dragon)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/08-TXTLoader.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/08-TXTLoader.ipynb)\n",
    "## Overview\n",
    "\n",
    "This tutorial focuses on using LangChain’s ```TextLoader``` to efficiently load and process individual text files. \n",
    "\n",
    "You’ll learn how to extract metadata and content, making it easier to prepare text data.\n",
    "\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environement Setup](#environment-setup)\n",
    "- [TXT Loader](#txt-loader)\n",
    "- [Automatic Encoding Detection with TextLoader](#automatic-encoding-detection-with-textloader)\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langchain\",\n",
    "        \"langchain_community\",\n",
    "        \"chardet\"\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TXT Loader\n",
    "\n",
    "Let’s explore how to load files with the **.txt** extension using a loader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of documents: 1\n",
      "\n",
      "[Metadata]\n",
      "\n",
      "{'source': 'data/appendix-keywords.txt'}\n",
      "\n",
      "========= [Preview - First 500 Characters] =========\n",
      "\n",
      "Semantic Search\n",
      "\n",
      "Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.\n",
      "Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”\n",
      "Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
      "\n",
      "Embedding\n",
      "\n",
      "Definition: Embedding is the process of converting textual data, such as words\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "\n",
    "# Create a text loader\n",
    "loader = TextLoader(\"data/appendix-keywords.txt\", encoding=\"utf-8\")\n",
    "\n",
    "# Load the document\n",
    "docs = loader.load()\n",
    "print(f\"Number of documents: {len(docs)}\\n\")\n",
    "print(\"[Metadata]\\n\")\n",
    "print(docs[0].metadata)\n",
    "print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
    "print(docs[0].page_content[:500])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Automatic Encoding Detection with TextLoader\n",
    "\n",
    "In this example, we explore several strategies for using the TextLoader class to efficiently load large batches of files from a directory with varying encodings.\n",
    "\n",
    "To illustrate the problem, we’ll first attempt to load multiple text files with arbitrary encodings.\n",
    "\n",
    "- ```silent_errors```: By passing the ```silent_errors``` parameter to the ```DirectoryLoader```, you can skip files that cannot be loaded and continue the loading process without interruptions.\n",
    "- ```autodetect_encoding```: Additionally, you can enable automatic encoding detection by passing the ```autodetect_encoding``` parameter to the loader class, allowing it to detect file encodings before failing.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import DirectoryLoader\n",
    "\n",
    "path = \"data/\"\n",
    "\n",
    "text_loader_kwargs = {\"autodetect_encoding\": True}\n",
    "\n",
    "loader = DirectoryLoader(\n",
    "    path,\n",
    "    glob=\"**/*.txt\",\n",
    "    loader_cls=TextLoader,\n",
    "    silent_errors=True,\n",
    "    loader_kwargs=text_loader_kwargs,\n",
    ")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **data/appendix-keywords.txt** file and its derivative files with similar names all have different encoding formats.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['data/appendix-keywords-CP949.txt',\n",
       " 'data/appendix-keywords-EUCKR.txt',\n",
       " 'data/appendix-keywords.txt',\n",
       " 'data/appendix-keywords-utf8.txt']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_sources = [doc.metadata[\"source\"] for doc in docs]\n",
    "doc_sources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Metadata]\n",
      "\n",
      "{'source': 'data/appendix-keywords-CP949.txt'}\n",
      "\n",
      "========= [Preview - First 500 Characters] =========\n",
      "\n",
      "Semantic Search\n",
      "\n",
      "Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user¡¯s query to return relevant results.\n",
      "Example: If a user searches for ¡°planets in the solar system,¡± the system might return information about related planets such as ¡°Jupiter¡± or ¡°Mars.¡±\n",
      "Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
      "\n",
      "Embedding\n",
      "\n",
      "Definition: Embedding is the process of converting textual data, such a\n"
     ]
    }
   ],
   "source": [
    "print(\"[Metadata]\\n\")\n",
    "print(docs[0].metadata)\n",
    "print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
    "print(docs[0].page_content[:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Metadata]\n",
      "\n",
      "{'source': 'data/appendix-keywords-EUCKR.txt'}\n",
      "\n",
      "========= [Preview - First 500 Characters] =========\n",
      "\n",
      "Semantic Search\n",
      "\n",
      "Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user¡¯s query to return relevant results.\n",
      "Example: If a user searches for ¡°planets in the solar system,¡± the system might return information about related planets such as ¡°Jupiter¡± or ¡°Mars.¡±\n",
      "Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
      "\n",
      "Embedding\n",
      "\n",
      "Definition: Embedding is the process of converting textual data, such a\n"
     ]
    }
   ],
   "source": [
    "print(\"[Metadata]\\n\")\n",
    "print(docs[1].metadata)\n",
    "print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
    "print(docs[1].page_content[:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Metadata]\n",
      "\n",
      "{'source': 'data/appendix-keywords-utf8.txt'}\n",
      "\n",
      "========= [Preview - First 500 Characters] =========\n",
      "\n",
      "Semantic Search\n",
      "\n",
      "Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.\n",
      "Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”\n",
      "Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
      "\n",
      "Embedding\n",
      "\n",
      "Definition: Embedding is the process of converting textual data, such as words\n"
     ]
    }
   ],
   "source": [
    "print(\"[Metadata]\\n\")\n",
    "print(docs[3].metadata)\n",
    "print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
    "print(docs[3].page_content[:500])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain-opentutorial-99wpaVyw-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
